vectorial feedback and complex objective
Regret Minimization for Reinforcement Learning with Vectorial Feedback and Complex Objectives
We consider an agent who is involved in an online Markov decision process, and receives a vector of outcomes every round. The agent aims to simultaneously optimize multiple objectives associated with the multi-dimensional outcomes. Due to state transitions, it is challenging to balance the vectorial outcomes for achieving near-optimality. In particular, contrary to the single objective case, stationary policies are generally sub-optimal. We propose a no-regret algorithm based on the Frank-Wolfe algorithm (Frank and Wolfe 1956), UCRL2 (Jaksch et al. 2010), as well as a crucial and novel gradient threshold procedure. The procedure involves carefully delaying gradient updates, and returns a non-stationary policy that diversifies the outcomes for optimizing the objectives.
Reviews: Regret Minimization for Reinforcement Learning with Vectorial Feedback and Complex Objectives
Summary: This paper studies a generalization of online reinforcement learning (in the infinite horizon undiscounted setting with finite state and action space and communicating MDP) where the agent aims at maximizing a certain type of concave function of the rewards (extended to global concave functions in appendix). More precisely, every time an action "a" is played in state "s", the agent receives a vector of rewards V(s,a) (instead of a scalar reward r(s,a)) and tries to maximize a concave function of the empirical average of the vectorial outcomes. This problem is very general and models a wide variety of different settings ranging from multi-objective optimization in MDPs, to maximum entropy exploration and online learning in MDPs with knapsack constraints. In section 2 the authors introduce the necessary background and formalize the notions of "optimal gain" and "regret" in this setting. Defining the "optimal gain" (called the "offline benchmark" in the paper) is not straightforward.
Reviews: Regret Minimization for Reinforcement Learning with Vectorial Feedback and Complex Objectives
Two out of three reviewers appreciated the contributions of this paper, with one expert reviewer praising almost every aspect of the paper. On the negative side, one reviewer took issue with the proposed setting, highlighting that the utility of the proposed objective function is somewhat dubious in the general context of multi-objective decision making. I agree with this reviewer in that having "multi-objective" in the title of the paper may set the wrong expectations for some readers, and I suggest that the authors consider changing the title of the paper for its final version to avoid such misunderstandings. Furthermore, the final version should discuss the relationship between this paper and the very recent work of Rosenberg and Mansour (2019) that studies essentially the same problem in episodic MDPs. Other than these concerns, the paper is worthy of being published without major changes.
Regret Minimization for Reinforcement Learning with Vectorial Feedback and Complex Objectives
We consider an agent who is involved in an online Markov decision process, and receives a vector of outcomes every round. The agent aims to simultaneously optimize multiple objectives associated with the multi-dimensional outcomes. Due to state transitions, it is challenging to balance the vectorial outcomes for achieving near-optimality. In particular, contrary to the single objective case, stationary policies are generally sub-optimal. We propose a no-regret algorithm based on the Frank-Wolfe algorithm (Frank and Wolfe 1956), UCRL2 (Jaksch et al. 2010), as well as a crucial and novel gradient threshold procedure.
Regret Minimization for Reinforcement Learning with Vectorial Feedback and Complex Objectives
We consider an agent who is involved in an online Markov decision process, and receives a vector of outcomes every round. The agent aims to simultaneously optimize multiple objectives associated with the multi-dimensional outcomes. Due to state transitions, it is challenging to balance the vectorial outcomes for achieving near-optimality. In particular, contrary to the single objective case, stationary policies are generally sub-optimal. We propose a no-regret algorithm based on the Frank-Wolfe algorithm (Frank and Wolfe 1956), UCRL2 (Jaksch et al. 2010), as well as a crucial and novel gradient threshold procedure.